Part 2

In Part 1, we discussed 3 elements of the grammar of graphics: data, aesthetics, & geometries. We will continue our understnading of data viz by focusing our attention on other important layers:

Statistics

Some statistics functions and geom functions can be used synonymously in ggplot2. An example of this is the geom_bar, geom_histogram and geom_freqpoly functions. Under the hood, these functions are using the stat_bin function to plot the data.

#assign plot object
p <- ggplot(iris, aes(x = Sepal.Width))

#plot with geom_histogram
p + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#plot with geom_bar
p + geom_bar()

#plot with stat_bin
p + stat_bin()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Similarly, we can apply the smoothing statistics applied by the stat_smooth function with geom_smooth.

#assign plot object
p <- ggplot(iris, aes(Petal.Length, Sepal.Length, color = factor(Species)))

#scatter plot with least squares modeling for each individual Species, and the dataset as a whole.
#we can determine whether the confidence interval ribbon appears by setting the 'se' argument.
p + geom_point() +
  geom_smooth(method = "lm") +
  geom_smooth(method = "lm", se = FALSE, aes(group = 1))

LOESS smoothing is a non-parametric form of regression that uses a weighted, sliding-window, average to calculate a line of best fit. We can control the size of this window with the span argument

#set individual models to loess (default) and adjust the span
p + 
  geom_point() +
  geom_smooth(se = F, span = 0.7) 
## `geom_smooth()` using method = 'loess'

#add overall model layer (loess) change individual model layers to 'lm'.
p + 
  geom_point()+
  geom_smooth(method = "lm") +
  stat_smooth(aes(group = 1), method = "loess", se = F, col = "black")

Notice in the plot above that I used both geom_smooth and stat_smooth. As mentioned before, these functions are interchangeable.

Another nice feature of the smoothing functions is that you can extend the model to the full range of the plot by calling the logical (T or F) fullrange argument. Notice how the further from the data you get, the se ribbon gets wider and wider.

#apply fullrange of predictions for the individual Species linear regression models.
p + 
  geom_point()+
  geom_smooth(method = "lm", fullrange = T) +
  stat_smooth(aes(group = 1), method = "loess", se = F, col = "black")

In the plot above, the overall model is not included in the legend even though we applied the attribute color = "black to it. We can fix this by adding the color as an aesthetic, but we lose our control over the color.

#add color as an aesthetic named 'All'
p + 
  geom_point()+
  geom_smooth(method = "lm") +
  stat_smooth(aes(group = 1, color = "All"), method = "loess", se = F)

Now the ‘All’ model appears in the legend but as I mentioned, we lost control over the color. We can fix this!

# create a color vector with 4 colors, one for each color we will use in our plot
 
colors <- c("black", wesanderson::wes_palette(name = 'Darjeeling', 3))

#add manual color scale to change the colors.
p + 
  geom_point()+
  geom_smooth(method = "lm") +
  stat_smooth(aes(group = 1, color = "All"), method = "loess", se = F)+
  scale_color_manual("Species Colors", values = colors)

With stat_quantile(), we can apply quantile regression to a dataset. By default, the 1st, 2nd, and 3rd quantiles are modeled as a response to the predictor variable. Speciic quantiles can be specified with the quantiles argument. For example, to show only the median quantile, we can set quantiles = 0.5

#examine the dataset
str(txhousing)
## Classes 'tbl_df', 'tbl' and 'data.frame':    8602 obs. of  9 variables:
##  $ city     : chr  "Abilene" "Abilene" "Abilene" "Abilene" ...
##  $ year     : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ month    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sales    : num  72 98 130 98 141 156 152 131 104 101 ...
##  $ volume   : num  5380000 6505000 9285000 9730000 10590000 ...
##  $ median   : num  71400 58700 58100 68600 67300 66900 73500 75000 64500 59300 ...
##  $ listings : num  701 746 784 785 794 780 742 765 771 764 ...
##  $ inventory: num  6.3 6.6 6.8 6.9 6.8 6.6 6.2 6.4 6.5 6.6 ...
##  $ date     : num  2000 2000 2000 2000 2000 ...
#create plot object sales vs. listings
p <- ggplot(txhousing, aes(x = listings, y = sales))

#scatterplot with quantile models
p + geom_point() + stat_quantile()
## Warning: Removed 1426 rows containing non-finite values (stat_quantile).
## Loading required package: SparseM
## 
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
## 
##     backsolve
## Smoothing formula not specified. Using: y ~ x
## Warning: Removed 1426 rows containing missing values (geom_point).

#group by year
p + 
  geom_point() +
  stat_quantile(aes(color = year))
## Warning: Removed 1426 rows containing non-finite values (stat_quantile).
## Smoothing formula not specified. Using: y ~ x
## Warning: Removed 1426 rows containing missing values (geom_point).

Changing the color aesthetic did not produce the desired effect. That’s because the year variable is an integer and we need it to be a factor.

#make year a factor and adjust aesthetics/attributes
p +
  geom_point() +
  stat_quantile(aes(color = factor(year)), alpha = 0.6, size = 2)
## Warning: Removed 1426 rows containing non-finite values (stat_quantile).
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Warning: Removed 1426 rows containing missing values (geom_point).

While it’s pretty, this plot is messy and not very readable. Let’s clean it up by limiting which quantiles are plotted and adjusting our color scale to something more intuitive.

Even though we made year a factor, time is really a continuous variable and so we want to treat it as such when we choose our color scale. We can do this by making the color = year a continuous color scale, but keeping our quantile model grouped for each year separately with group = factor(year).

#creat plot object as before. Add color and group aesthetics
p <- ggplot(txhousing, aes(x = listings, y = sales, color = year, group = factor(year)))

#Plot point and quantile models for the median quantile. Modify the color scheme.
colors <- RColorBrewer::brewer.pal(11, 'RdYlBu')

p + geom_point(color = "black", size = .75) +
  stat_quantile(alpha = 0.75, size = 2, quantiles = 0.5) +
  scale_color_gradientn(colours = colors)
## Warning: Removed 1426 rows containing non-finite values (stat_quantile).
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Smoothing formula not specified. Using: y ~ x
## Warning: Removed 1426 rows containing missing values (geom_point).

The stat_sum function is useful for calculating the counts for each group in a dataset.

#create diamonds plot object clarity vs. cut
p <- ggplot(diamonds, aes(cut, clarity))

#make scatterplot
p + geom_point()

#reveal overplotting
p + geom_jitter(width = 0.3)

#apply stat_sum 
p + stat_sum()

#adjust scale_size
p + stat_sum() +
  scale_size(range = c(1,10))

Stat Summary

Stat_summary can be used to perform summary statistics in conjunction with various ggplot2 and hmisc functions.

For example, the mean_cl_normal function can be used to generate the mean and the lower and upper confidence limits on a variable.

# examine the Rabbit dataset
str(Rabbit)
## 'data.frame':    60 obs. of  5 variables:
##  $ BPchange : num  0.5 4.5 10 26 37 32 1 1.25 4 12 ...
##  $ Dose     : num  6.25 12.5 25 50 100 200 6.25 12.5 25 50 ...
##  $ Run      : Factor w/ 10 levels "C1","C2","C3",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ Treatment: Factor w/ 2 levels "Control","MDL": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Animal   : Factor w/ 5 levels "R1","R2","R3",..: 1 1 1 1 1 1 2 2 2 2 ...
#create plot object BPchange vs. Dose using Rabbit dataset
p <- ggplot(Rabbit, aes(x = factor(Dose), y = BPchange, color = Treatment))

#take a look at the plot
p + geom_point(position = position_jitter(0.2))

#use stat_summary function to generate mean for each BPchange per Dose
p +
  stat_summary(geom = 'point', fun.y = mean)

#examine mean_cl_normal function
mean_cl_normal(Rabbit$BPchange)
##          y     ymin     ymax
## 1 11.21833 8.253206 14.18346
#assign position dodge function
posn.d <- position_dodge(width = 0.5)

#use stat_summary to generate confidence intervals of the mean
p + stat_summary(geom = 'errorbar', position = posn.d, fun.data = mean_cl_normal, size = 1) +
  stat_summary(geom = 'point', position = posn.d, fun.y = mean, shape = "X", size = 3)

You can even create custom functions to generate stats. The only caveat is that the variable names need to match the agrguments of the geometry being called.

#create min and max range function
range_function <- function(x){
  data.frame(ymin = min(x),
             ymax = max(x))
}

#demonstrate range function
range_function(Rabbit$BPchange)
##   ymin ymax
## 1  0.5   37
#create median interquartile range function (calculates the median, 25% and 75% quartiles)
med_IQR <- function(x){
  data.frame( y = median(x),
              ymin = quantile(x)[2],
              ymax = quantile(x)[4])
}

#demonstarte quantile and med_IQR function
quantile(Rabbit$BPchange)
##    0%   25%   50%   75%  100% 
##  0.50  1.65  4.75 20.50 37.00
med_IQR(Rabbit$BPchange)
##        y ymin ymax
## 25% 4.75 1.65 20.5
#use functions in stat_summary to plot the data
#redundancy of Treatment variable is necessary to adjust the attributes of the stat_summary functions 
p <- ggplot(Rabbit, aes(x = factor(Dose), y = BPchange, color = Treatment, group = Treatment, fill = Treatment))

p + 
  stat_summary(geom = 'linerange', fun.data = med_IQR, size = 2, position = posn.d) +
  stat_summary(geom = 'linerange', fun.data = range_function, size = 1.5, alpha = 0.5, position = posn.d)+
  stat_summary(geom = 'point', fun.y = median, shape = "X", color = 'black', size = 2, position = posn.d)

Coordinates & Facets

You can adjust the scales in various ways. There is a series of function beginning with scale that allow you to set the breaks and limits, among other arguments. There are also coordinate functions which allow you to manipulate the scales as well. It is important to understand the consequences of the functions as the can lead to different plotting results.

For instance, you can zoom in on a section of a plot using the scale_x_continuous function, but this may cut off portions of your plot as in the example below.

#create plot object using iris dataset
p <- ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
  geom_point() +
  geom_smooth()

#view plot
p
## `geom_smooth()` using method = 'loess'

#zoom in using scale function
p + scale_x_continuous(limits = c(3.5, 5.5))
## `geom_smooth()` using method = 'loess'
## Warning: Removed 91 rows containing non-finite values (stat_smooth).
## Warning: Removed 91 rows containing missing values (geom_point).

In the plot above, the loess smoothing for the virginica species does not appear because only one data point exists due to the limits we set.

What we really want is to get a zoomed in snapshot of this section of the plot as it is. For this we can use the coord_cartesian function to adjust the plot without losing the actual information.

#apply coord_cartesian to plot object
p + coord_cartesian(xlim = c(3.5,5.5))
## `geom_smooth()` using method = 'loess'

As a rule of thumb, it is good practice to use a 1:1 aspect ratio when your axes show the same scales.

The following dataset uses the crabs dataset from the MASS package. CW refers to carpace width (mm) and RW refers to rear width (mm). The carpace is the upper section of the shell of the crab.

#create plot object using the crabs dataset from the MASS package
p <- ggplot(crabs, aes(x = CW  ,y = RW, color = factor(sp), label = factor(sp), linetype = sex)) + 
  geom_text() + stat_smooth(method = "lm") +
  scale_color_manual("species", values = c("steelblue3","darkorange"))
#plot with default aspect ratio
p

#plot with fixed 1:1 aspect ratio
p + coord_equal()

The coord_polar function converts planar x-y cartesian plots into polar coordinates. This can be useful for making pie charts. In general, it is best practice to avoid pie charts. Other plots can capture the same information in much more meaningful ways.

p <- ggplot(diamonds, aes(x = 1, fill = clarity))+
  geom_bar(width = 2)

p + coord_polar(theta = 'y')